[WIP] Support for indexing the codebase #6762

voodoos · 2022-12-22T15:04:20Z

This PR is still a work in progress and is part of an effort to support project-wide occurrences when working with OCaml.

The current plan to support project-wide occurrences has three distinct parts:

A new tool that can index values in a cmt: ocaml-uideps
New Dune rules that orchestrate the calls to ocaml-uideps to generate an index of an entire project (this PR)
Merlin / Editor mode support to query that index

The heart of this PR is the uideps.ml module, that exposes two functions, cctx_rules and project_rule, that respectively provide rules to generate indexes for each compilation context and for the global project.

The generation of the indexes can be summarized as:

Index every CMT file in the project
For each lib and exe stanza, aggregate their cmt's indexes
Aggregate all these indexes into the projects's global index

                                        ──┐
              project.uideps              │
                    ▲                     │
                    │                     │
          ┌─────────┴─────────┐           │
          │                   │           │ "project rule"
          │                   │           │
          │                   │           │
                                          │
     cctx.uideps         cctx.uideps    ──┤
                                          │
          ▲                   ▲           │
          │                   │           │
     ┌────┴───┐               │           │
     │        │               │           │
     │        │               │           │
 a.uideps  b.uideps      main.uideps      │   "ctx rules"
                                          │
     ▲       ▲                ▲           │
     │       │                │           │
┌────┼───────┼────┐     ┌─────┼────┐      │
│    │       │    │     │     │    │      │
│  a.cmt    b.cmt │     │ main.cmt │      │
│  a.ml     b.ml  │     │ main.ml  │      │
│                 │     │          │    ──┘
│       Lib       │     │    Exe   │
└─────────────────┘     └──────────┘

Here a a few of the points that I think would be worth investigating / discussing:

Indexing is, in essence, a slow process. There is a lot of room for performance improvement right now, and that is the main reason this PR is a draft. However it is to be expected that even after optimization, the indexing process might remains too slow to be considered a standard Dune target, especially on big projects. I can see several options:
- We could remove this target from the @all alias. That means the user must knowingly trigger the index build.
- Dune could treat progress distinctly: tell that the actual build is over but indexing is still in progress.
I used a new "syntax extension" (with no actual syntax) to make the feature opt-in, but a dune-project option might be more suitable.
To generate the "project" rule that merges indexes from all cctxes, I rely on Dune_file.fold_stanzas. @emillon told me that it might be problematic. @rgrinberg do you have more knowledge about this ?

Maybe we could have that discussing during on of Dune's dev meeting when people will be back from holidays.

rgrinberg · 2022-12-26T14:39:56Z

I'm worried about how this design produces a lot of intermediate artifacts that aren't always going to be useful. For example, the user might trigger a few recompilation of the project before actually using the symbols occurrences" feature. Did you explore a design where theses indices can be produced on demand?

How big are these index files?

I used a new "syntax extension" (with no actual syntax) to make the feature opt-in, but a dune-project option might be more suitable.

Does that mean this syntax extension msut be enabled in every project where this featured is to be used? That seems like a painful UI to me. IMO it should be be a workspace or a dune-config option.

To generate the "project" rule that merges indexes from all cctxes, I rely on Dune_file.fold_stanzas. @emillon told me that it might be problematic. @rgrinberg do you have more knowledge about this ?

Yes, that's "slow" but given the proposed design, I don't see a better way to do this.

voodoos · 2022-12-29T10:46:41Z

I'm worried about how this design produces a lot of intermediate artifacts that aren't always going to be useful. For example, the user might trigger a few recompilation of the project before actually using the symbols occurrences" feature. Did you explore a design where theses indices can be produced on demand?

I did not explore any "on demand" design yet, I wanted to try and see if it would be sustainable to keep the index up-to-date in watch mode for small / medium size projects. And I think it is right now.

An hybrid model would probably work best:

Allow users to enable always-on indexing if they want (via an option or simply the use of the correct alias)
Add a mechanism so that the editors plugins know if the index is up to date and can ask for a rebuild if it is not (probably via the RPC for OCaml-LSP).

How big are these index files?

On Dune itself, the final index is 5.0M but the total size with all the intermediaries is 53M. (for reference, the total size of the cmt files is 50M).
It is expected that the total size is higher than the final index because there is some redundancy and sharing is worst on individual files. This redundancy is necessary to get fast incremental builds.

rgrinberg · 2022-12-30T03:49:24Z

Add a mechanism so that the editors plugins know if the index is up to date and can ask for a rebuild if it is not (probably via the RPC for OCaml-LSP).

Will we need the entire index to do any queries? E.g. if I want to search for uses of a function that is private to a library, do I need an index for the entire workspace?

I wanted to try and see if it would be sustainable to keep the index up-to-date in watch mode for small / medium size projects. And I think it is right now.

The issue isn't only about size and performance, although those matter too. Suppose that the user makes an occurrence request while the index is still building, what's the failure mode? Do we tell the user to retry later? Do we delay his request until all the indexes are built?

It would seem more sensible to me for merlin to ask dune to build whatever index file it needs as it's serving a query.

In any case, this is already good. I'd just like to us to try and improve on the UX we already have with merlin/lsp rather than rely on our usual reliably but relatively awkward patterns.

By the way, where can I find the corresponding merlin code that looks up these indices?

voodoos · 2022-12-30T10:08:26Z

Will we need the entire index to do any queries? E.g. if I want to search for uses of a function that is private to a library, do I need an index for the entire workspace?

With the current implementation the entire index is needed. But it is an interesting perspective! Not sure how we should do it however.

The issue isn't only about size and performance, although those matter too. Suppose that the user makes an occurrence request while the index is still building, what's the failure mode? Do we tell the user to retry later? Do we delay his request until all the indexes are built?

It would seem more sensible to me for merlin to ask dune to build whatever index file it needs as it's serving a query.

In any case, this is already good. I'd just like to us to try and improve on the UX we already have with merlin/lsp rather than rely on our usual reliably but relatively awkward patterns.

Yes I agree that we need to think about the UX, the current implementation is just laying the basis.

By the way, where can I find the corresponding merlin code that looks up these indices?

Just pushed my latest changes here, but it is still very wip:
https://github.com/ocaml/merlin/compare/master...voodoos:merlin:project-wide-occurrences?expand=1

And the small ocaml-lsp changes (also uncleaned):
https://github.com/ocaml/ocaml-lsp/compare/master...voodoos:ocaml-lsp:project-wide-occurrences?expand=1

ELLIOTTCABLE · 2023-01-10T19:33:10Z

Do we tell the user to retry later? Do we delay his request until all the indexes are built?

Tiny observation: I've never used a project-wide-occurrences feature on any sizable project that didn't produce a loading indicator. :P

I suspect this is the type of feature that doesn't necessarily need blazing-fast perf upon invocation to really be fantastically useful; especially if some/most occurrences can be produced quickly while the rest are populated in the background / asynchronously. VSCode even includes a 'refresh' button on the "Find All References" interface, which is very telling:

Correspondingly, as someone who'd use a project-wide reference index at least weekly, I wouldn't personally want the regular build/watch feature slowed down by it, unless it was nearly imperceptible. Plenty of my coworkers would not use such a feature much-if-at-all, and we already have enough build-time issues. 😓

(Hopefully this is useful feedback; this work looks super cool, and I'm planning to check out your branches and give it a spin it against our codebase later today!)

sidkshatriya · 2023-01-11T07:38:39Z

Thanks for working on this. This feature is eagerly awaited.

If a project has hundreds of cmt files, there will be a lot of uideps artifacts. Even if an artifact is quite small, it will occupy a minimum space in the filesystem. Then there is a problem of combining each of these files into an aggregate uideps file. When there are so many files to handle, there is a potential for more things to go wrong.

One way to potentially save space (and perhaps get some more robustness) is to use a single file which is a database. Why not use a single SQLITE db for all the indexing operations and store all your project-wide-occurences data in it? This db would store the content of individual uideps and aggregate uideps.

Depending on how you design the schema, the sqlite db can greatly help with updating the index, querying etc. It may even help you with aggregation and reduce disk space by reducing the amount of redundant data that needs to be stored in each uideps file.

Since there is only one source of truth for the project everything becomes simpler. Since updates to the db can be made transactional, there will never be a scenario in which the project index will be in some inconsistent state when it is being updated. In other words the index is always queryable even when it is slightly outdated.

This is admittedly a large change. This design approach may have already been considered in the past. Just mentioning in case this is appealing.

nojb · 2023-01-11T08:09:28Z

The current plan to support project-wide occurrences has three distinct parts:

Sorry for the naïve question, but could you explain what you mean by project-wide occurrences concretely?

As a data point, we have a typed .cmt-based "grep" tool at LexiFi which one can use to search for structural OCaml terms (eg if (__1 : float) <> __1 then __ else List.length __ matches anything of the form if e1 <> e1 then e2 else List.length e3 where e1 is of type float), and we don't index them, we just scan them for every query. We use this quite a lot (on both Linux and Windows), and is quite usable in terms of performance.

Have you done any benchmark to compare performance with and without the indexing step? (As it seems to add some complexitiy and performance cost.)

Could it make sense to do a first implementation of this feature without the indexing step, and later, if performance is an issue, add the indexing step?

voodoos · 2023-01-11T09:57:56Z

@ELLIOTTCABLE

I suspect this is the type of feature that doesn't necessarily need blazing-fast perf upon invocation to really be fantastically useful

An hybrid model would probably be the best in very large codebases: an asynchronously generated index for the complete codebase and a live re-indexing of the part of the project being worked on. Being aware of the visibility of the definition would also be way to cut the search tree as @rgrinberg suggested.

Correspondingly, as someone who'd use a project-wide reference index at least weekly, I wouldn't personally want the regular build/watch feature slowed down by it, unless it was nearly imperceptible.

I agree that it probably should not be a part of the all target, even when the feature is activated manually via an option in the dune-project file.

@sidkshatriya

Why not use a single SQLITE db for all the indexing operations and store all your project-wide-occurences data in it?

The main pro of the "individual files" method is that it plays well with Dune's (and lot of other build systems) file-based model: this allows the build system to maximize parallelism and incrementality at the cost of bigger artifacts. It also fits closely OCaml separate compilation model so indexing can start when individual cmt files are ready. For cases when the simplicity+time vs size compromise is not viable I plan to add support for different indexing strategies in the indexing tool itself (like calling the tool once for every cmt, or using a database as a store).

@nojb

Sorry for the naïve question, but could you explain what you mean by project-wide occurrences concretely?

By that I mean the ability to select a value in the buffer and display all the usages of that same value across the entire project. So far this only works in the active buffer.

As a data point, we have a typed .cmt-based "grep" tool at LexiFi which one can use to search for structural OCaml terms

That sounds like a great tool! Do you call it on every cmt file in the code-base and it iterates on the whole typed-tree?
I guess it is done in parallel ? I can see two possible differences with value indexing:

We need to iterate on the typedtree and identify the exact definition of every value we found. This is done by reducing "shapes" and is quite expensive even if we memoïze a lot of the computation. So I would guess that it is more expensive than the computation you do.
Iterating on the cmt to look for structural terms has no dependency on other cmts, right? In our case when we reduce shapes we might need to load a lot of other cmt files to find the definition of a value, which is also quite expensive and makes parallelization more difficult to achieve (in fact in the current version I reduce shapes partially in individual cmt files and only perform full reduction when aggregating indexes to take advantage of caching).

Have you done any benchmark to compare performance with and without the indexing step?

Currently the ocaml-lsp project (80K lines) it takes 2.5s to build the index on my (M1 pro) machine when the project has already been built without the index. The difference when comparing a completely from-scratch build with and without the index is smaller: around 1.5s of a total of 10s.

An incremental rebuild of the index after a one-file change takes around 0.2s more than a rebuild without the indexing.

Could it make sense to do a first implementation of this feature without the indexing step

I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings. It would require quite a lot of changes to have Merlin directly build the index and would probably be much slower without Dune's incredible parallelism and incrementality.

rgrinberg · 2023-01-12T17:02:45Z

The main pro of the "individual files" method is that it plays well with Dune's (and lot of other build systems) file-based model: this allows the build system to maximize parallelism and incrementality at the cost of bigger artifacts. It also fits closely OCaml separate compilation model so indexing can start when individual cmt files are ready. For cases when the simplicity+time vs size compromise is not viable I plan to add support for different indexing strategies in the indexing tool itself (like calling the tool once for every cmt, or using a database as a store).

Yes, that's indeed the case why we usually lean towards files. But I find that to be a crutch and a sorely missing feature in dune. With a proper database, we would be losing incrementality and parallelism only at the stage when we serialize the write into the database. The actual indexing operations would be retaining all those good properties. Nevertheless, using a database isn't realistic until we have a database that supports something like dune's "stale artifact deletion" for records.

I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings. It would require quite a lot of changes to have Merlin directly build the index and would probably be much slower without Dune's incredible parallelism and incrementality.

It seems concerning that merlin is so tied to a particular implementation of the database index. Would it be possible to make sure that the db is a separate component with a well defined API?

voodoos · 2023-01-12T17:16:10Z

The "satisfying" part was the Dune rules :-)

So far the format (basically a marshalled map) of the index is defined with some magic numbers and a shared module between Merlin and the indexing binary.

One way to decouple this would be to define a small rpc (that doesn't need to run in the background) for Merlin to query the indexing binary for occurrences. It would be easier to extend the protocol if we want to index more thing. Is that the kind of API you had in mind ?

rgrinberg · 2023-01-12T17:53:35Z

It doesn't need to be RPC, a shared module should be enough. It just needs to encapsulate the details of marshalling and IO so that they only live in one place.

Signed-off-by: Ulysse Gérard <[email protected]>

Index generation opt-in with workspace option This means that monorepo are fully indexed if the host workspace has indexation activated. Signed-off-by: Ulysse Gérard <[email protected]>

Signed-off-by: Ulysse Gérard <[email protected]>

ELLIOTTCABLE · 2023-11-08T16:39:12Z

I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings.

The remaining work to get this work merged, if I understand correctly, seems to be:

Abstract the index-access,
rebase/merge up to the latest of the various involved projects,
then get these PRs opened for review/merge?

Is there any way I can contribute? I truly believe this is one of the big things holding back the 'newbie experience' in the OCaml ecosystem, would love to help push it forward! 😅

voodoos · 2024-04-15T12:29:51Z

Closing in favor of #10422 that update these rules to the new process which involve the compiler pre-indexing cmt files.

rgrinberg added requires-team-discussion This topic requires a team discussion merlin labels Dec 22, 2022

voodoos force-pushed the ocaml-uideps branch from 3ade745 to 06c8984 Compare December 28, 2022 15:41

ulugbekna mentioned this pull request Jan 10, 2023

Global API search ocaml/merlin#61

Open

voodoos force-pushed the ocaml-uideps branch from d84425e to 9d21b4d Compare January 20, 2023 13:50

voodoos added 2 commits June 19, 2023 15:50

Add a test showing that S dirs of implicit deps are not sent to Merlin

943851c

Signed-off-by: Ulysse Gérard <[email protected]>

Add all dependency source dirs to Merlin's configuration.

8decbd4

Signed-off-by: Ulysse Gérard <[email protected]>

voodoos force-pushed the ocaml-uideps branch from 2dc84ff to def28b1 Compare June 19, 2023 14:19

voodoos added 9 commits June 19, 2023 16:20

feature: Generate uideps files for all modules and projects

13e20e5

Index generation opt-in with workspace option This means that monorepo are fully indexed if the host workspace has indexation activated. Signed-off-by: Ulysse Gérard <[email protected]>

wip: Add Dune build dir to Merlin's configuration

0bb3b26

Signed-off-by: Ulysse Gérard <[email protected]>

tests: update tests with new BUILD_DIR field in merlin config

54ee66e

Signed-off-by: Ulysse Gérard <[email protected]>

test: Add tests for uideps generation and aggregation

ae87713

Signed-off-by: Ulysse Gérard <[email protected]>

Add ocaml-uideps as a test dependency and remove form opam file

2194ad8

test: Add a vendored lib

3c0adcb

Only index public libs from vendored projects

f038c23

test: add private vendored lib

0035eca

Remove debug or stale comments

5d9e0b4

voodoos force-pushed the ocaml-uideps branch from def28b1 to 5d9e0b4 Compare June 19, 2023 14:21

Rename test

67d6cb0

voodoos added 2 commits June 19, 2023 16:25

Promote updated tests

7bf20a4

Pass index file path to Merlin

ed12165

voodoos mentioned this pull request Apr 15, 2024

[5.2] Build occurrences index for Merlin #10422

Merged

voodoos closed this Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support for indexing the codebase #6762

[WIP] Support for indexing the codebase #6762

voodoos commented Dec 22, 2022

rgrinberg commented Dec 26, 2022

voodoos commented Dec 29, 2022 •

edited

Loading

rgrinberg commented Dec 30, 2022

voodoos commented Dec 30, 2022 •

edited

Loading

ELLIOTTCABLE commented Jan 10, 2023 •

edited

Loading

sidkshatriya commented Jan 11, 2023

nojb commented Jan 11, 2023

voodoos commented Jan 11, 2023 •

edited

Loading

rgrinberg commented Jan 12, 2023

voodoos commented Jan 12, 2023 •

edited

Loading

rgrinberg commented Jan 12, 2023

ELLIOTTCABLE commented Nov 8, 2023

voodoos commented Apr 15, 2024

[WIP] Support for indexing the codebase #6762

[WIP] Support for indexing the codebase #6762

Conversation

voodoos commented Dec 22, 2022

rgrinberg commented Dec 26, 2022

voodoos commented Dec 29, 2022 • edited Loading

rgrinberg commented Dec 30, 2022

voodoos commented Dec 30, 2022 • edited Loading

ELLIOTTCABLE commented Jan 10, 2023 • edited Loading

sidkshatriya commented Jan 11, 2023

nojb commented Jan 11, 2023

voodoos commented Jan 11, 2023 • edited Loading

rgrinberg commented Jan 12, 2023

voodoos commented Jan 12, 2023 • edited Loading

rgrinberg commented Jan 12, 2023

ELLIOTTCABLE commented Nov 8, 2023

voodoos commented Apr 15, 2024

voodoos commented Dec 29, 2022 •

edited

Loading

voodoos commented Dec 30, 2022 •

edited

Loading

ELLIOTTCABLE commented Jan 10, 2023 •

edited

Loading

voodoos commented Jan 11, 2023 •

edited

Loading

voodoos commented Jan 12, 2023 •

edited

Loading